true

Analysis Motivation

The goal of this data analysis is to explore the relationship between asthma hospitalizations, as a measure of the human impact of air pollution, and the amount of green space in different California counties. We hope to learn whether higher levels of asthma hospitalization correlate to lower proportions of green space.

Although asthma hospitalization rates are not a perfect measure of air pollution, they are strongly linked, and California county data on asthma hospitalization rates is publicly available. Within asthma hospitalization rates, we’ll look specifically at age groups and race to assess the varied hospitalization rates of different groups. Racial makeup of a county is often correlated with socioeconomic status, and therefore we hope to examine whether areas with a greater POC (people of color) population have both higher hospitalization rates and less green space, which are both factors that can correlate with areas of lower socioeconomic status. However, it’s important to note that race is certainly not an exact metric for socio-economic status. Our observations may have implications on how socioeconomic levels of a county are linked to differing levels of asthma hospitalizations or green space, but we will not be conclusively defining that correlation within this study.

To assess each county’s green space, the proportion of county area which is parkland and the number of parks per county are used. Both are included because they look at green space in two different ways, and considering area helps to account for the fact that parks can be drastically different sizes. Park land data is an imperfect measure of tree or plant coverage because urban parks can contain few plants. Rural regions can have large areas of plant coverage remaining in private hands, and these areas can improve air quality despite not being open to the public. However, parks data is publicly available, and does generally give a good idea of the number of parks most public citizens in a county should have access to, as well as the area that these public green spaces cover. It would be helpful to have additional data that analyzes all land cover and divides it into percentages of grass cover, forest cover, building cover, and street cover, or similar categories, but that data was not available at this point.

Some of the major questions we are interested in answering include: Does a higher amount of open green spaces or a higher number of public parks correlate to lower asthma hospitalization rates across California counties? What are the differences in racial and age makeup of hospitalizations across counties? What is the relationship between the racial makeup of asthma hospitalizations and the amount of greenspace in a county? What is the relationship between age makeup of hospitalizations and greenspace? Is there a difference in the relationship between number of open parks and asthma hospitalization rate, and proportion of park land and asthma hospitalization rate? Which serves as a better predictor?

While exploring the data, we first examined how open greenspace and number of parks differed across California counties. We focused on the variables open park land, number of parks, and proportion of park-to-total county land. We discovered that although most counties only use about 10% of their land for parks, counties often have a high number of outlier areas. These areas are census tracts, which contain between 1200-8000 people each, which means some specific areas within each county have much larger amounts of open green space than the average. We also looked generally at the relationship between open park land and age-adjusted hospitalization rate by county but did not see any major correlation.

Exploratory Data Analysis

First, we load our datasets and mutate them to add columns with summary statistics for each county (code not shown for brevity).

##  [1] "asthma_parks_race"             "asthmaCA_kids"                
##  [3] "asthmaCA_kids_2"               "asthmaCA_kids_v_adults"       
##  [5] "asthmaCA_race_ethnicity"       "clean_parks_data"             
##  [7] "parks_asthmaCA_kids"           "parks_asthmaCA_kids_2"        
##  [9] "parks_asthmaCA_kids_v_adults"  "parks_asthmaCA_race_ethnicity"

First, we take a look at some of the overall trends across counties regarding open park area and asthma hospitalizations and age-adjusted hospitalization rate.

Overall Plot of County Tract Open Park Area by County and colored by County Tract Number of Hospitalizations

Overall Plot of County Open Park Area by County and colored by County’s Number of Hospitalizations

Next, we’ll facet the data by age to get a sense of trends for each age group (kids ages 0-17 or adults 18+).

Faceted by Kids v. Adult: Age-Adjusted Hospitalization Rate by County and colored by County Open Park Area

Next, we created maps (using a combination of the USAboundaries, sf, and tmap packages) to visualize the asthma hospitalization rates across different counties by each Race/Ethnicity. Since there was hardly any data for AI/AN category, we do not include it in our map. Note also: these maps were originally going to be featured in our interactive, but there were some issues with tmap and shiny. Since there are only three different maps, we decided to feature them here instead, as they are still important visualizations of our data!

Map of California Counties and Age-Adjusted Hospitalization Rate for White Individuals

Map of California Counties and Age-Adjusted Hospitalization Rate for Hispanic Individuals

Map of California Counties and Age-Adjusted Hospitalization Rate for Black Individuals

*Additionally, from all of these maps, Fresno County (the yellow county in the center of the state) pops out as a county of interest yet again (as it did earlier on in our EDA), as it has the highest asthma hospitalization rate for each race/ethnicity group. Imperial County (at the very bottom of the state) appears to have a high asthma hospitalization rate compared to other counties.

NEED HELP CLEANING THIS UP - WILL PUT ANOTHER HEADER AT END OF SECTION

DATA: parks_asthmaCA_race_ethnicity

parks_data_by_tract <- parks_asthmaCA_race_ethnicity %>%
group_by(tractcode) %>%
summarize(open_parks_tract = mean(open_parks_tract), tract_area_sqmiles = mean(tract_area_sqmiles), total_open_park_area_sqmiles = mean(total_open_park_area_sqmiles), county_name = county_name)
## `summarise()` has grouped output by 'tractcode'. You can override using the
## `.groups` argument.
parks_data_by_county <- parks_data_by_tract %>%
group_by(county_name) %>%
summarize(total_county_area_sqm = sum(tract_area_sqmiles), total_county_park_area_sqm = sum(total_open_park_area_sqmiles), county_num_parks = sum(open_parks_tract))
hospitalizations_white <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "White") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_white = mean(number_hospitalizations))

hospitalizations_black <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "Black") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_black = mean(number_hospitalizations))

hospitalizations_hispanic <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "Hispanic") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_hispanic = mean(number_hospitalizations))

hospitalizations_asian_pi <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "Asian/PI") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_asian_pi = mean(number_hospitalizations))

hospitalizations_ai_an <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "AI/AN") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_ai_an = mean(number_hospitalizations))

hospitalizations_poc <- asthmaCA_race_ethnicity %>%
filter(!STRATA_NAME == "White") %>%
group_by(COUNTY) %>%
summarize(number_hospitalizations_poc = sum(NUMBER_OF_HOSPITALIZATIONS))
hospitalizations_total <- asthmaCA_race_ethnicity %>% group_by(COUNTY) %>% summarise(NUMBER_OF_HOSPITALIZATIONS = sum(NUMBER_OF_HOSPITALIZATIONS))
CA_county_asthma_parks <-
full_join(parks_data_by_county, hospitalizations_white, by = c("county_name" = "county_name"))

CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_poc, by = c("county_name" = "COUNTY"))

CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_hispanic, by = c("county_name" = "county_name"))

CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_asian_pi, by = c("county_name" = "county_name"))

# as the AI/AN category values were all 0 or NA, we decided it was not useful to keep analyzing that information

CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_black, by = c("county_name" = "county_name"))

CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_total, by = c("county_name" = "COUNTY"))

DATA: parks_asthma_CA_kids

sums <- parks_asthmaCA_kids %>% group_by(county_name) %>% mutate(total_tract_area = sum(tract_area_sqmiles), total_open_park_area = sum(total_open_park_area_sqmiles), num_open_parks = sum(open_parks_tract))

avgs <- parks_asthmaCA_kids %>% group_by(county_name) %>% mutate(avg_tract_area = mean(tract_area_sqmiles), avg_open_park_area = mean(total_open_park_area_sqmiles), avg_num_open_parks = mean(open_parks_tract))
ggplotly(sums %>% ggplot(aes(num_open_parks, number_hospitalizations, col = county_name)) + geom_point() + labs(x = "Count of Open Parks", y = "Hospitalizations", col = "County Name"))
ggplotly(sums %>% ggplot(aes(num_open_parks, age_adjusted_hospitalization_rate, col = county_name)) + geom_point() + labs(x = "Open Parks", y = "Hospitalizations Rate", col = "County Name"))
ggplotly(sums %>% ggplot(aes(x = total_open_park_area, y = number_hospitalizations, col = age_adjusted_hospitalization_rate)) + geom_jitter() + labs(x = "Open Parks Area", y = "Hospitalizations", col = "Hospitalization Rate"))
ggplotly(sums %>% ggplot(aes(x = total_open_park_area, y = number_hospitalizations, col = county_name)) + geom_jitter() + labs(x = "Open Parks", y = "Hospitalizations", col = "County Name"))

END OF SECTION I NEED HELP WITH LOL

Modeling and Inference

parks_asthmaCA_kids_v_adults_countystats <- parks_asthmaCA_kids_v_adults_countystats %>%
  filter(!is.na(county_number_hospitalizations), !is.na(county_open_park_area))

mod1 <- lm(county_number_hospitalizations ~ county_open_park_area, data = parks_asthmaCA_kids_v_adults_countystats)
beta <- coef(mod1)
parks_asthmaCA_kids_v_adults_countystats %>% ggplot(aes(x = county_open_park_area, y = county_number_hospitalizations)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red")

Flaws and Limitations